Add CUDA process checkpointing helpers by kkraus14 · Pull Request #1983 · NVIDIA/cuda-python

kkraus14 · 2026-04-28T16:19:57Z

Summary

add a dedicated cuda.core.checkpoint module for CUDA process checkpointing APIs while keeping cuda.core.system focused on CUDA system/NVML capabilities
expose a narrow runtime API via checkpoint.Process(pid): read-only pid, state, restore_thread_id, lock, checkpoint, restore, and unlock
keep the checkpoint module public surface limited to Process; the state return type lives in cuda.core.typing.ProcessStateT and is rendered in the private API docs
map Process.state from the CUDA driver CUprocessState enumerators rather than raw integer values
support restore-time GPU UUID remapping using either driver CUuuid values or Device.uuid strings; migration docs and tests now describe the stricter kernel-mode-driver visibility requirement rather than user-space CUDA visibility
document the coordinator/target-process model, Linux permission requirements such as CAP_SYS_PTRACE, the CRIU/CPU-process-image boundary, restore-thread requirement, and persistence mode/cuInit restore requirement
validate checkpoint API availability lazily and cache the successful check, covering the cuda-bindings version, required binding symbols, and CUDA driver version
consolidate checkpoint driver call handling in one boundary that translates missing checkpoint symbols and unsupported checkpoint CUDA results into a checkpoint-specific RuntimeError
re-enable checkpoint driver coverage in CI by running driver-backed lifecycle tests in isolated coordinator/target subprocess scenarios with parent-side timeouts; migration tests skip when the CUDA device view is masked because the mapping cannot be proven KMD-complete

Testing

git commit -S pre-commit hooks: ruff, formatting, SPDX, whitespace, RST, and related checks passed
git diff --check
pixi run ruff check cuda_core/cuda/core/checkpoint.py cuda_core/tests/test_checkpoint.py (All checks passed)
pixi run --manifest-path cuda_core pytest cuda_core/tests/test_checkpoint.py cuda_core/tests/test_typing_imports.py (10 passed, 6 skipped)
pixi run --manifest-path cuda_core -e docs docs-build-latest (Sphinx build succeeded)
previous broader local run: pixi run --manifest-path cuda_core pytest cuda_core/tests --ignore=cuda_core/tests/cython (2798 passed, 352 skipped, 2 failed)

The two Python-suite failures in the broader run are existing local NVML/system environment failures and are not related to this checkpointing change:

cuda_core/tests/system/test_system_device.py::test_get_inforom_version returns an empty InfoROM board part number locally.
cuda_core/tests/system/test_system_system.py::test_get_process_name hits an NVML UTF-8 decode error locally.

Additional local build/test notes:

pixi run --manifest-path cuda_core test stops before pytest in the existing build-cython-tests pre-step because cuda_core/tests/cython/test_get_cuda_native_handle.pyx cannot find the expected cuda.bindings .pxd files in this local pixi environment.
pixi build from cuda_core reaches the existing native cuda-core extension build and then fails with CUDA 12.9 headers that do not declare CU_MEM_ALLOCATION_TYPE_MANAGED; this is in the existing graph extension build path and is not checkpoint-specific.

CI note:

The previous CI attempt on 8192df67 exposed two unrelated/runner issues: one Windows py3.12 build failed inside the shared mini-CTK cache setup before any cuda.core build step, and CUDA 13.x GPU test jobs were canceled after the old in-process checkpoint test hung in cuCheckpointProcessCheckpoint.
The current head removes the broad CI skip. Driver-backed checkpoint lifecycle tests now run through isolated subprocess coordinator/target scenarios, and the parent pytest process can kill and skip a scenario that times out instead of letting the CI job hang.

Current Test Implementation

The checkpoint tests in cuda_core/tests/test_checkpoint.py are real driver/GPU tests, not broad mocks.

Input validation and public-symbol checks run everywhere. Driver-backed lifecycle tests create a target process that initializes a real CUDA context, then a coordinator scenario calls checkpoint.Process(target.pid) and exercises state, restore_thread_id, lock, checkpoint, restore, and unlock through the real driver. The parent pytest process enforces a timeout around each scenario so unsupported driver/hardware paths skip cleanly instead of hanging the test job.

Migration tests require at least two same-chip GPUs and an unmasked CUDA device view. They build full UUID mappings using Device.uuid strings, then exercise rotation and pair-swap migration patterns through Process.restore(gpu_mapping=...) in the isolated target process. They skip gracefully when CUDA_VISIBLE_DEVICES is set, when the local hardware lacks a same-chip GPU pair, or when the driver rejects/no-ops checkpoint migration.

copy-pr-bot · 2026-04-28T16:20:00Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

kkraus14 · 2026-04-28T16:31:42Z

/ok to test

copy-pr-bot · 2026-04-28T16:31:46Z

/ok to test

@kkraus14, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

kkraus14 · 2026-04-28T16:38:41Z

/ok to test 7c66b2f

copy-pr-bot · 2026-04-28T16:44:37Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

kkraus14 · 2026-04-28T16:44:59Z

/ok to test

github-actions · 2026-04-28T17:07:34Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-1983/
https://nvidia.github.io/cuda-python/pr-preview/pr-1983/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-1983/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-1983/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

kkraus14 · 2026-04-28T18:33:27Z

/ok to test

kkraus14 · 2026-04-28T19:14:18Z

/ok to test

kkraus14 · 2026-04-28T20:24:54Z

/ok to test

Replace the entire mock-based test suite with real GPU tests that exercise the CUDA driver checkpoint API directly: - Input validation: pid type/range, public symbol checks - Lifecycle (single GPU): state transitions at every stage (running→locked→checkpointed→locked→running), restore_thread_id, lock/unlock, lock with timeout, full checkpoint-restore cycle - GPU migration: rotation mapping and same-chip swap following the r580-migration-api.c pattern; gracefully skip when the driver does not support migration (CUDA_ERROR_INVALID_VALUE — NVBug 5437334) The self_process fixture wraps os.getpid() and safety-unlocks on teardown if the test fails mid-lifecycle. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- checkpoint._make_restore_args now accepts UUID strings (as returned by Device.uuid) in addition to CUuuid objects, via a new _as_cuuuid helper that converts "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" strings to CUuuid using ctypes. - Tests no longer import cuda.bindings.driver; all device queries use cuda.core.Device (Device().uuid for current device, Device.uuid for mapping keys/values, Device.get_all_devices() for enumeration). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Ruff import sorting, ruff format, and noqa annotation for best-effort teardown in the self_process fixture. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The swap migration test calls set_current() on a different device. Record the initial device from init_cuda and restore it on teardown so tests are side-effect free. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

kkraus14

Addressed the current CUDA checkpointing review comments in signed commit 8192df6.

kkraus14 · 2026-05-04T15:59:49Z

Review follow-up notes for signed commit 8192df67:

GitHub accepted the inline reply for the restore-mapping xref thread, but the remaining inline reply attempts are currently returning GitHub server errors, so I am summarizing the responses here.

ProcessStateT: moved to cuda.core.typing, included in typing.__all__, and listed in api_private.rst. cuda.core.checkpoint imports it privately and does not expose it in checkpoint.__all__ or as checkpoint.ProcessStateT.
Process state mapping: updated to use the actual driver.CUprocessState.CU_PROCESS_STATE_* enumerators as keys rather than raw Python ints.
Checkpoint docs: rewrote the section to describe the coordinator/target PID model, show os.getpid() only as a self-checkpoint example, call out Linux permissions including CAP_SYS_PTRACE, explain the CRIU/CPU-process-image boundary, and document restore-thread plus persistence mode/cuInit requirements.
Tests: replaced the broad mock suite with real driver/GPU tests. The lifecycle tests self-checkpoint the pytest process through lock, checkpoint, restore, and unlock; migration tests rotate/swap full CUDA-visible Device.uuid string mappings on suitable same-chip multi-GPU systems and skip unsupported hardware/driver cases.
Mapping semantics: docs and tests now use full CUDA-visible mappings for migration. Tests also skip at call time when a driver exposes checkpoint symbols but returns CUDA_ERROR_NOT_SUPPORTED for actual checkpoint operations.

kkraus14 · 2026-05-04T15:59:55Z

/ok to test

copy-pr-bot · 2026-05-04T15:59:58Z

/ok to test

@kkraus14, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

kkraus14 · 2026-05-04T16:00:52Z

/ok to test 8192df6

leofang · 2026-05-04T16:43:11Z

Copy-pasta from my bot, with internal info redacted.

What I did

Discovered GPU migration doesn't work on this machine (heterogeneous GPUs + known driver bugs)
Searched internal resources (Confluence, gdrive, NVBugs, P4) to understand the root cause
Rewrote the full test suite with real GPU tests and no mocks
Pushed 4 commits to Keith's PR branch

Test suite structure (13 tests)

5 input validation (no GPU needed): invalid pid types/values, public symbols
6 lifecycle (single GPU, real driver): state transitions at every step, restore_thread_id, lock/unlock with timeouts, full checkpoint/restore cycle
2 migration (≥2 same-chip GPUs): rotation and swap — exercise real driver API, skip gracefully when unsupported

Lessons learned

nvidia-smi order ≠ CUDA device order. We wasted time testing the wrong GPU pair because CUDA_VISIBLE_DEVICES=0,3 selected Ada + A6000 (different chips) instead of two A6000s. Always use Device.uuid to identify GPUs, never assume nvidia-smi indices match CUDA indices.
CUDA_VISIBLE_DEVICES breaks checkpoint migration. Tested locally: even a 2-pair identity mapping fails with CUDA_ERROR_INVALID_VALUE when only 2 of 4 GPUs are visible, while the same 4-pair identity mapping succeeds with all GPUs visible. The CUDA driver docs state the mapping must cover "every checkpointed GPU," and the driver records all physically attached GPUs at checkpoint time — not just the CUDA_VISIBLE_DEVICES subset.
Migration requires exact chip match. Tested locally: any mapping that remaps between different architectures (e.g. Ada ↔️ A6000) is rejected with CUDA_ERROR_INVALID_VALUE. Same-chip mappings (A6000 ↔️ A6000) are accepted. The public CUDA docs say "the GPU to restore onto needs to be of the same chip type as the old GPU."
Migration may be a no-op on some driver versions. Tested locally: same-chip A6000 swap is accepted (no error) but the context device UUID doesn't change — the driver silently no-ops. This is consistent with NVBug 5437334 (api_reverse_gpu_pairs fails on GA100x4) and NVBug 5544504 (customer report: "restore on a different GPU — the checkpointed process does not appear in the GPU process list").
Checkpoint state machine is strict. Can't unlock from "checkpointed" state — must restore first (CUDA_ERROR_ILLEGAL_STATE). Can't corrupt GPU memory between checkpoint and restore within the same process (CUDA APIs are frozen). The overwrite-then-restore scenario requires an external coordinator (CRIU).
Use cuda.core APIs in tests, not raw bindings. Device().uuid for current device, Device.uuid for mapping keys, Device.get_all_devices() for enumeration. The implementation (checkpoint.py) handles the string-to-CUuuid conversion internally.

leofang · 2026-05-04T17:23:53Z

Pushed e9c03de because the real tests hang in the CI...

leofang · 2026-05-04T17:24:12Z

/ok to test e9c03de

cuCheckpointProcessCheckpoint hangs on CI runners (ephemeral VM + container), causing all CUDA 13.x test jobs to time out. Skip the tests that call into the checkpoint driver when the CI environment variable is set. Input validation tests still run everywhere. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

kkraus14 · 2026-05-04T21:00:55Z

/ok to test 8f798f4

leofang · 2026-05-04T22:14:06Z

@kkraus14 I noticed your force-push (to stringify the tests for subprocess) still includes my WAR (checking the env var CI), so the tests are still not run in the CI environment, is it intended?

leofang

Since I also made some changes, would be nice for @Andy-Jost to re-review 🙂

kkraus14 · 2026-05-04T23:13:31Z

Addressed Leo’s latest review comments in signed commit 376acc7f18.

On CI: the remaining CI skip was a temporary workaround, not the intended final state. I removed the broad CI skip after moving all driver-backed lifecycle tests onto the isolated subprocess coordinator/target harness. The parent pytest process enforces timeouts and kills the scenario process group on timeout, so CI should get checkpoint coverage without repeating the previous job-level hang. Migration tests still skip when CUDA_VISIBLE_DEVICES is set or when the hardware/driver cannot provide valid same-chip migration, because the restore mapping must cover the KMD-visible GPU set.

Other updates:

Process.pid is now read-only via private _pid storage plus a property, with test coverage.
_call_driver now owns checkpoint missing-symbol and unsupported-result translation in one place.
Docs and migration tests now describe/use the KMD visibility rule instead of the looser CUDA-visible wording.
The PR description has been updated for the current implementation and validation state.

Local validation passed: ruff, git diff --check, focused pytest (10 passed, 6 skipped), and docs-build-latest.

kkraus14 · 2026-05-04T23:13:39Z

/ok to test 376acc7

github-actions Bot added the cuda.core Everything related to the cuda.core module label Apr 28, 2026

kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch from 396a2ca to 7c66b2f Compare April 28, 2026 16:28

kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch 2 times, most recently from 779c697 to 82f816c Compare April 28, 2026 16:44

kkraus14 commented Apr 28, 2026

View reviewed changes

Comment thread cuda_core/cuda/core/system/__init__.py Outdated

Comment thread cuda_core/cuda/core/checkpoint.py Outdated

Comment thread cuda_core/cuda/core/checkpoint.py Outdated

Comment thread cuda_core/cuda/core/checkpoint.py Outdated

Comment thread cuda_core/cuda/core/checkpoint.py Outdated

kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch from 82f816c to 25455d8 Compare April 28, 2026 18:22

kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch from 25455d8 to aaf1418 Compare April 28, 2026 19:14

kkraus14 commented Apr 28, 2026

View reviewed changes

Comment thread cuda_core/cuda/core/checkpoint.py Outdated

Add CUDA process checkpointing helpers

d8a2031

kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch from aaf1418 to d8a2031 Compare April 28, 2026 20:24

kkraus14 marked this pull request as ready for review April 29, 2026 13:59

kkraus14 added the feature New feature or request label Apr 29, 2026

kkraus14 added this to the cuda.core v1.0.0 milestone Apr 29, 2026

kkraus14 self-assigned this Apr 29, 2026

rparolin requested review from leofang and rparolin April 29, 2026 17:44

rparolin reviewed Apr 29, 2026

View reviewed changes

Comment thread cuda_core/tests/test_checkpoint.py Outdated

rparolin reviewed Apr 29, 2026

View reviewed changes

Comment thread cuda_core/cuda/core/checkpoint.py

rparolin reviewed Apr 29, 2026

View reviewed changes

Comment thread cuda_core/cuda/core/checkpoint.py

leofang requested changes Apr 29, 2026

View reviewed changes

Comment thread cuda_core/cuda/core/checkpoint.py

Comment thread cuda_core/cuda/core/checkpoint.py Outdated

Comment thread cuda_core/cuda/core/checkpoint.py Outdated

Comment thread cuda_core/cuda/core/checkpoint.py